How To Protect Your Files From Robots

How To Protect Your Files From Robots
by Erika Lawal


Optimizing website pages for the search engines without running 
into trouble at the very least causes most of us webmasters to 
keep our brain cells finely honed, and at worst induces massive 
migraines!
 
One of the most common challenges for us all is how to present 
"clean", relevant and original content to a wide range of 
visitors.  
 
You may find that you want to exclude search engine and other 
robots from all or part of your website for a number of reasons 
including;
 
- you want to write similar pages for different types of
visitors, but don't want to be penalized for duplication.
 
- you want to prepare pages or files that you don't want
viewed.
 
It's very easy to achieve this by one of two means. You can use 
either a robot.txt file or a meta tag.  
 
Let's de-mystify the process of writing these files and tags!
 
WHAT IS A ROBOTS.TXT FILE?
 
A robots.txt file is an instruction to the robots that travel 
the web, spidering the pages they find there. There are several 
forms such a file can take - how often to traverse the site, if 
at all, and how.
 
The robots.txt file we're considering here is an exclusion 
instruction - think of it as a "no entry" sign to robots.
 
You can write a file to exclude ("disallow") robots from all, or 
just part of your site.
 
Before you begin, you need to know how to write the .txt file.
 
Prepare it in a text editor such as Notepad. Don't attempt it in
Word or an HTML editor such as FrontPage. When you're finished, 
save it as "robots.txt".
 
WHAT TO PUT IN YOUR ROBOTS.TXT FILE
 
If you want to disallow all robots, you'd write;
 
User-agent: *
Disallow: /
 
And that's all. Nothing else.
 
What about if you only want to exclude part of your site?
 
Let's pretend you're running a website which advises on raising
children. Your material will be relevant to surfers who live in
many countries, but if you want them to really sit up and look,
especially if you want them to buy from you, you'll need to make
sure that your content is region-specific, including references,
idiom and spelling.
 
This situation is an ideal candidate for a robots exclusion .txt 
file.
 
You've written all the pages you want to show to surfers in 
Canada, UK, and Australia in 3 separate directories which 
surfers will access by clicking on an appropriate link on your 
main pages.
 
The directories are:
/ca/
/uk/
/au/
 
To disallow robots from these directories write the following
.txt file;
 
User-agent: *
Disallow: /ca/
Disallow: /uk/
Disallow: /au/
 
It may be that you want to allow some robots and disallow others.
 
In our example, it may be that you want to disallow just one 
robot, from one directory, in which case you'd write;
 
User-agent: NastyBot
Disallow: /ca/
 
Or to exclude all robots except one, which you want to traverse
all of your site;
 
User-agent: NiceBot
Disallow:
 
User-agent: *
Disallow: /ca/
 
Note that if you don't enter a slash, that means the robots are
permitted to read the whole site.  "*" means all known robots,
So in the last .txt file example, all robots are excluded from 
your Canadian directory, except NiceBot, which can read the 
whole site.
 
Easy isn't it!
 
WHERE TO PUT YOUR ROBOTS.TXT FILE
 
Once created, your file needs to go into your root directory.  
This is the same directory which contains your home page. Don't 
put it anywhere else, because the robots won't see it.
 
Note that you can only have ONE robots.txt file per site, so any 
modifications will need to be integrated into your original file.
 
Note also that writing a no index robots.txt file means these 
pages won't be indexed, but that won't matter if you've 
optimized your indexed pages properly.  
 
In our Ca/UK/Au example above, your traffic will find your 
indexed global/US pages via the search engines, and will make 
the link to their "nationality" page from the point of entry to 
your site - we've all seen the little flag links on other sites 
- just put up a flag graphic and say for example; "UK Visitors 
Click Here".
 
If you want to learn more about exclusion robots.txt files, 
visit:
 
http://www.robotstxt.org/wc/exclusion-admin.html
 
If you prefer/need to exclude individual pages from being viewed 
by robots, you can do this using a robots.txt file, but you can 
also achieve it using a meta tag on your web page between the 
<head> tags. The universal exclusion is as follows;
 
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
 
It may be that you want robots to index your pages, but not to 
archive them. There may be a range of reasons why you don't 
want search engines to keep copies of old pages - the most 
prevalent one among webmasters is because they are cloaking 
pages and don't want it known that the page served to search 
engines is a different one to that seen by surfers, but it's 
also possible to have perfectly "legitimate" reasons for wanting 
to exclude parts of your site from public scrutiny. 
 
Whatever your reason, if you want to avoid your page being 
indexed, the universal tag is;
 
<META NAME="ROBOTS" CONTENT="NOARCHIVE">
 
For Google (the search engine you are most likely to want to 
avoid archiving your pages for its cache feature), the tag is:
 
<META NAME="GOOGLEBOT" CONTENT="NOARCHIVE">   .
 
To learn more about exclusion meta tags, visit;
 
http://www.robotstxt.org/wc/exclusion.html#meta
 
Don't be put off by the jargon; writing these files and tags 
is one of the easiest and most useful technical tasks you can 
undertake as a webmaster - write a file today and save yourself 
hundreds of hours!


================================================================
Erika Lawal writes Daily Internet Marketing Tips for webmasters 
desperately in search of cutting edge site optimization and 
marketing advice that produces results. Get a FREE series of 
our Tips by visiting: 
http://www.dailyinternetmarketingtips.com/spronews.html  
================================================================